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QCDOC is a supercomputer designed for high scalability at a low cost per node. We discuss the status of 
the project and provide performance estimates for large machines obtained from cycle accurate simulation of the 
QCDOC ASIC. 



Introduction 

QCDOC was designed for a cost-effective 
balance between memory bandwidth, floating 
point performance and communication perfor- 
mance when running massively parallelised dou- 
ble precision QCD codes. As shall be seen, our 
simulations suggest QCDOC should scale to ex- 
ceptionally large machine sizes on a fixed scientif- 
ically interesting lattice volume - or equivalently 
QCDOC will operate efficiently on very small lo- 
cal volumes. This gives the focussed compute 
power necessary to simulate much lighter quark 
masses than are accessible today, at a cost of 1 
US-dollar per sustained Megaflop. 

Hardware Overview 

The QCDOC design is based on an applica- 
tion specific integrated circuit, or ASIC. We use 
IBM System-On-a-Chip technology to integrate 
all the node logic on a single silicon chip. The 
CPU core is a PowerPC 4400 whose FPU|Q can 
execute one operate instruction per clock cycle. 
Peak performance of two flops per clock is ob- 
tained by using fused multiply-add instructions. 
The on-chip features include 4MByte of embed- 
ded DRAM memory and its controller, a memory 



controller for off-chip DDR memory, an Ether- 
net interface, and a custom serial communications 
unit (SCU) linked by a number of on-chip busses. 

The prefetching edram controller (PEC) has 
been custom designed by our colleagues at 
IBM Research, and provides ample bandwidth 
for the FPU. It automatically prefetches two 
read streams, and is multi-ported allowing re- 
mote communication to overlap local computa- 
tion without contention. 

The SCU hardware implements a 6 dimensional 
hyper-torus, with bi-directional low latency links. 
The aggregate 1.5GByte/s inter-node bandwidth 
easily supports QCD with 1 Gflop of local float- 
ing point all the way down to 2 4 local volumes. 
The DMA engines in the SCU transfer complex 
patterns between nodes while the local floating 
point calculation proceeds uninterrupted. 

The high dimensionality of the network both 
increases the communication bandwidth of the 
machine, and gives very symmetric local volumes 
when distributing the problem over very many 
nodes in four, or even five, dimensions. This 
yields the most favorable surface to volume ra- 
tio. 



* Presented by PAB at Lattice 2002, Boston. 



Table 1 

Performance of double precision assembler ker- 
nels produced using the speaker's code genera- 
tor. Very high fractions of the lGflop raw peak, 
and even higher fractions of the theoretical peak, 
are obtained. Some compiled code figures are in- 
cluded to show that reasonable performance can 
be obtained without undue pain. 



Operation 


Mflops/node 


SU3-SU3 


800 


SU3-2spinor 


780 


DAXPY 


190 


ZAXPY 


450 


DAXPY-Norm 


350 


Clover Term/asm 


790 


Clover Term/gcc 


150 


Clover Term /xlc 


300 



Hardware Status 

The QCDOC ASIC design is functionally com- 
plete, and the preliminary design release is un- 
dergoing timing driven physical layout at IBM 
Raleigh. Verification is on-going with release to 
manufacture expected by mid fall and first silicon 
at the end of this year. Large prototype machines 
are expected in spring 2003. 

Software Environment 

Two widespread and standard compilers are 
being used, namely the IBM xlc compiler and the 
GNU gec C and C++ compilers. Debug facili- 
ties are provided by a port of the full featured 
RISCWatch remote debug tool. 

The standard runtime support libraries are a 
popular GNU Public License libc released by 
Cygnusf]. Runtime libraries are available for per- 
forming communication over the physics network 
via the SciDac QMP interface which is in turn 
implemented on top of a native SCU interface. 

Assembler optimised QCD kernels will be freely 
available. 

Node kernels 

The use of the PowerPC means that standard 
operating systems could in principle be run on 
the nodes. However, to scale to very small local 
volumes it is essential that, in addition to very low 

2 Popularised in the Cygwin software package 



Table 2 

Sample performance of red-black fermion opera- 
tors in cycle accurate simulation. The Wilson and 
Clover kernels were generated using the speaker's 
C++ code scheduler, while the Staggered opera- 
tor was hand coded by Calin Cristian. The codes 
are very efficient even when all sites are on mul- 
tiple boundaries. 



Operation 


Local Vol. 


Mflops/node 


Wilson D eo 


2 4 


470 


Wilson D eo 


4 4 


535 


Clover D eo 


2 4 


560 


Clover D eo 


4 4 


590 


Staggered D eo 


2 4 


370 


Staggered D eo 


2 2 .4 2 


430 



latency hardware, we avoid unnecessary software 
latency. 

We will use a lean node kernel to avoid sched- 
uler overhead and use the PowerPC virtual mem- 
ory (VM) hardware to protect, but not translate, 
memory pages. This gives both the benefits of ro- 
bust and graceful error recovery, and the benefits 
of zero-copy SCU transfers without VM gymnas- 
tics. 

Host operating system 

The host for QCDOC will be an SMP 
Unix server with multiple Gigabit links to the 
Boot/Diagnostic/IO Ethernet network. A multi- 
threaded qdaemon will boot and manage the ma- 
chine partitions, and service socket-based connec- 
tions to these partitions from a number of client 
programs. Sufficient functionality has already 
been implemented to boot and download code to 
PowerPC boards in parallel. 

Performance in simulation 

As part of the design verification programme a 
number of benchmarks and stress tests have been 
written, to both check functional correctness and 
that design goals are met. We shall present per- 
formance figures for some of these tests, where 
we have taken a "nominal" 500MHz CPU, and 
made reasonable assumptions about the intercon- 
nect wire length. 

Table [l] shows a sample of common vector op- 
erations performed in Lattice QCD codes. Most 



Figure 1. Estimated performance per node 
for assembly coded Wilson dslash operators as a 
function of local volume. Roughly two orders of 
magnitude better scalability is observed for QC- 
DOC. 
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of these figures were obtained using a C++ as- 
sembler code generator written by the speaker to 
automate loop unrolling, prefetching and detailed 
scheduling. 

Table || shows performance measurements of 
optimised implementations of various precondi- 
tioned Fermion operators in double precision. In 
these operators unpaired adds and multiplies set 
an upper limit (e.g. 780Mflops for the Wilson 
kernel) somewhat below the "raw" peak. The 
communication overhead is entirely taken into ac- 
count, and the performance is excellent despite 
having to communicate each site over four wires 
in the 2 4 case. 

Scalability 

Performance characteristics at small local vol- 
umes are critical to scaling. Figure |l| shows the 
estimated performance of QCDOC on double pre- 
cision Wilson dslash code (omits global summa- 
tion time) distributed in four dimensions. At 
the smallest volumes the entire operation takes 
roughly 20 /is, with eight communications in this 
time. For comparison we overlay a Myrinet Alpha 
21264 cluster running single precision assembler 



Table 3 

Estimates for Wilson CG performance on a 
32 3 x 64 Lattice. Both Clover and Domain Wall 
simulations would be even more scalable. 



Nodes 


AftM 


Gsum 


Sust. Tflops 


4096 


2620/is 


10 fis 


2.15 


8192 


1310/xs 


11.5/xs 


4.2 


16384 


680 fis 


13 fis 


8.1 


32768 


340 fis 


15 fis 


15.6 



code distributed in three dimensions. 

Given the fast linear algebra and the scalable 
matrix multiply the remaining hurdle to be over- 
come by an iterative solver is global summation. 
The SCU has "pass-thru" hardware assist for 
global sums and broadcasts. The global sum has 
been benchmarked in simulation, and we use the 
mixture of matrix multiplies, linalg and gsums in 
the CG algorithm to predict the performance on 
large machines on Wilson HMC in Table ||. 

Conclusions 

QCDOC is progressing well with large ma- 
chines due in 2003. Simulations indicate these 
machines will run QCD at high efficiency on the 
largest machines, while having a low cost per sus- 
tained Megaflop and low power consumption. 
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